25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

Size: px

Start display at page:

Download "25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods."

Leon Pearson
5 years ago
Views:

1 CS/ECE/ISyE 524 Introduction to Optimization Spring NLP algorithms ˆ Overview ˆ Local methods ˆ Constrained optimization ˆ Global methods ˆ Black-box methods ˆ Course wrap-up Laurent Lessard (

2 Review of algorithms Studying Linear Programs, we talked about: ˆ Simplex method: traverse the surface of the feasible polyhedron looking for the vertex with minimum cost. Only applicable for linear programs. Used by solvers such as Clp and CPLEX. Hybrid versions used by Gurobi and Mosek. ˆ Interior point methods: traverse the inside of the feasible polyhedron and move towards the boundary point with minimum cost. Applicable to many different types of optimization problems. Used by SCS, ECOS, Ipopt. 25-2

3 Review of algorithms Studying Mixed Integer Programs, we talked about: ˆ Cutting plane methods: solve a sequence of LP relaxations and keep adding cuts (special extra linear constraints) until solution is integral, and therefore optimal. Also applicable for more general convex problems. ˆ Branch and bound methods: solve a sequence of LP relaxations (upper bounding), and branch on fractional variables (lower bounding). Store problems in a tree, prune branches that aren t fruitful. Most optimization problems can be solved this way. You just need a way to branch (split the feasible set) and a way to bound (efficiently relax). ˆ Variants of methods above are used by all MIP solvers. 25-3

4 Overview of NLP algorithms To solve Nonlinear Programs with continuous variables, there is a wide variety of available algorithms. We ll assume the problem has the standard form: minimize x f 0 (x) subject to: f i (x) 0 for i = 1,..., m ˆ What works best depends on the kind of problem you re solving. We need to talk about problem categories. 25-4

5 Overview of NLP algorithms 1. Are the functions differentiable? Can we efficiently compute gradients or second derivatives of the f i? 2. What problem size are we dealing with? a few variables and constraints? hundreds? thousands? millions? 3. Do we want to find local optima, or do we need the global optimum (more difficult!) 4. Does the objective function have a large number of local minima? or a relatively small number? Note: items 3 and 4 don t matter if the problem is convex. In that case any local minimum is also a global minimum! 25-5

6 Survey of NLP algorithms ˆ Local methods using derivative information. It s what most NLP solvers use (and what most JuMP solvers use). unconstrained case constrained case ˆ Global methods ˆ Derivative-free methods 25-6

7 Local methods using derivatives Let s start with the unconstrained case: minimize x f (x) slow fast Stochastic gradient descent Gradient descent Accelerated methods Conjugate gradient Many methods available! Quasi-Newton methods Newton s method cheap expensive 25-7

8 Iterative methods Local methods iteratively step through the space looking for a point where f (x) = pick a starting point x choose a direction to move in k. This is the part where different algorithms do different things. 3. update your location x k+1 = x k + k 4. repeat until you re happy with the function value or the algorithm has ceased to make progress. 25-8

9 Vector calculus Suppose f : R n R is a twice-differentiable function. ˆ The gradient of f is a function f : R n R n defined by: [ f ]i = f x i f (x) points in the direction of greatest increase of f at x. ˆ The Hessian of f is a function 2 f : R n R n n where: [ 2 f ] ij = 2 f x i x j 2 f (x) is a matrix that encodes the curvature of f at x. 25-9

10 Vector calculus Example: suppose f (x, y) = x 2 + 3xy + 5y 2 7x + 2 ] [ ] 2x + 3y 7 ˆ f = = 3x + 10y ˆ 2 f = [ f x f y [ 2 f x 2 2 f x y 2 f x y 2 f y 2 ] = [ ] Taylor s theorem in n dimensions best linear approximation {}}{ f (x) f (x 0 ) + f (x 0 ) T (x x 0 ) (x x 0) T 2 f (x 0 )(x x 0 ) + }{{} best quadratic approximation 25-10

11 Gradient descent ˆ The simplest of all iterative methods. It s a first-order method, which means it only uses gradient information: x k+1 = x k t k f (x k ) ˆ f (x k ) points in the direction of local steepest decrease of the function. We will move in this direction. ˆ t k is the stepsize. Many ways to choose it: Pick a constant tk = t Pick a slowly decreasing stepsize, such as tk = 1/ k Exact line search: tk = arg min t f (x k t f (x k )). A heuristic method (most common in practice). Example: backtracking line search

12 Gradient descent We can gain insight into the effectiveness of a method by seeing how it performs on a quadratic: f (x) = 1x T Qx. The 2 condition number κ := λmax(q) λ min determines convergence. (Q) Optimal step Shorter step Even shorter κ = 10 Optimal step Shorter step Even shorter κ = distance to optimal point distance to optimal point Optimal step Shorter step Even shorter number of iterations Optimal step Shorter step Even shorter number of iterations 25-12

13 Gradient descent Advantages ˆ Simple to implement and cheap to execute. ˆ Can be easily adjusted. ˆ Robust in the presence of noise and uncertainty. Disadvantages ˆ Convergence is slow. ˆ Sensitive to conditioning. Even rescaling a variable can have a substantial effect on performance! ˆ Not always easy to tune the stepsize. Note: The idea of preconditioning (rescaling) before solving adds another layer of possible customizations and tradeoffs

14 Other first-order methods Accelerated methods (momentum methods) ˆ Still a first-order method, but makes use of past iterates to accelerate convergence. Example: the Heavy-ball method: x k+1 = x k α k f (x k ) + β k (x k x k 1 ) Other examples: Nesterov, Beck & Teboulle, others. ˆ Can achieve substantial improvement over gradient descent with only a moderate increase in computational cost ˆ Not as robust to noise as gradient descent, and can be more difficult to tune because there are more parameters

15 Other first-order methods Mini-batch stochastic gradient descent (SGD) ˆ Useful if f (x) = N i=1 f i(x). Use direction i S f i(x k ) where S {1,..., N}. Size of S determines batch size. S = 1 is SGD and S = N is ordinary gradient descent. ˆ Same pros and cons as gradient descent, but allows further tradeoff of speed vs computation. ˆ Industry standard for big-data problems like deep learning. Nonlinear conjugate gradient ˆ Variant of the standard conjugate gradient algorithm for solving Ax = b, but adapted for use in general optimization. ˆ Requires more computation than accelerated methods. ˆ Converges exactly in a finite number of steps when applied to quadratic functions

16 Newton s method Basic idea: approximate the function as a quadratic, move directly to the minimum of that quadratic, and repeat. ˆ If we re at x k, then by Taylor s theorem: f (x) f (x k )+ f (x k ) T (x x 0 )+ 1 2 (x x k) T 2 f (x k )(x x k ) ˆ If 2 f (x k ) 0, the minimum of the quadratic occurs at: x k+1 := x opt = x k 2 f (x k ) 1 f (x k ) ˆ Newton s method is a second-order method; it requires computing the Hessian (second derivatives)

17 Newton s method in 1D Example: f (x) = log(e x+3 + e 2x+2 ) starting at: x 0 = (x 1, f 1 ) (x 0, f 0 ) 3.4 (x 2, f 2 ) x example by: L. El Ghaoui, UC Berkeley, EE127a 25-17

18 Newton s method in 1D 60 (x 1, f 1 ) Example: f (x) = log(e x+3 + e 2x+2 ) starting at: x 0 = divergent! x 2 = (x 0, f 0 ) x example by: L. El Ghaoui, UC Berkeley, EE127a 25-18

19 Newton s method Advantages ˆ It s usually very fast. Converges to the exact optimum in one iteration if the objective is quadratic. ˆ It s scale-invariant. Convergence rate is not affected by any linear scaling or transformation of the variables. Disadvantages ˆ If n is large, storing the Hessian (an n n matrix) and computing 2 f (x k ) 1 f (x k ) can be prohibitively expensive. ˆ If 2 f (x k ) 0, Newton s method may converge to a local maximum or a saddle point. ˆ May fail to converge at all if we start too far from the optimal point

20 Quasi-Newton methods ˆ An approximate Newton s method that doesn t require computing the Hessian. ˆ Uses an approximation H k 2 f (x k ) 1 that can be updated directly and is faster to compute than the full Hessian. x k+1 = x k H k f (x k ) H k+1 = g(h k, f (x k ), x k ) ˆ Several popular update schemes for H k : DFP (Davidon Fletcher Powell) BFGS (Broyden Fletcher Goldfarb Shanno) 25-20

21 Example ˆ f (x, y) = e (x 3)/2 + e (x+4y)/10 + e (x 4y)/10 ˆ Function is smooth, with a single minimum near (4.03, 0) Gradient Nesterov BFGS Newton y x 25-21

22 Example Plot showing iterations to convergence: distance to optimal point Gradient Nesterov BFGS Newton number of iterations ˆ Illustrates the complexity vs performance tradeoff. ˆ Nesterov s method doesn t always converge uniformly. ˆ Julia code: IterativeMethods.ipynb 25-22

23 Recap of local methods Important: For any of the local methods we ve seen, if f (x k ) = 0, then x k+1 = x k and we we won t move! slow fast Stochastic gradient descent Gradient descent Accelerated methods Conjugate gradient Quasi-Newton methods Newton s method cheap expensive 25-23

24 Constrained local optimization Algorithms we ve seen so far are designed for unconstrained optimization. How do we deal with constraints? ˆ We ll revisit interior point methods, and we ll also talk about a class of algorithms called active set methods. ˆ These are among the most popular methods for smooth constrained optimization

25 Interior point methods minimize x f 0 (x) subject to: f i (x) 0 Basic idea: augment the objective function using a barrier that goes to infinity as we approach a constraint. minimize x m f 0 (x) µ log ( f i (x) ) i=1 Then, alternate between (1) an iteration of an unconstrained method (usually Newton s) and (2) shrinking µ toward zero

26 Interior point methods 4 3 Example: f 0 (x) = 1x with 2 x 2. µ = 0.5 µ = 0.2 µ = x 25-26

27 Active set methods minimize x f 0 (x) subject to: f i (x) 0 Basic idea: at optimality, some of the constraints will be active (equal to zero). The others can be ignored. ˆ given some active set, we can solve or approximate the solution of the simultaneous equalities (constraints not in the active set are ignored). Approximations typically use linear (LP) or quadratic (QP) functions. ˆ inequality constraints are then added or removed from the active set based on certain rules, then repeat. ˆ the simplex method is an example of an active set method

28 NLP solvers in JuMP ˆ Ipopt (Interior Point OPTimizer) uses an interior point method to handle constraints. If second derivative information is available, it uses a sparse Newton iteration, otherwise it uses a BFGS or SR1 (another Quasi-Newton method). ˆ Knitro (Nonlinear Interior point Trust Region Optimization) implements four different algorithms. Two are interior point (one is algebraic, the other uses conjugate-gradient as the solver). The other two are active set (one uses sequential LP approximations, the other uses sequential QP approximations). ˆ NLopt is an open-source platform that interfaces with many (currently 43) different solvers. Only a handful are currently available in JuMP, but some are global/derivative-free

29 NLopt solvers Algorithms LD_AUGLAG LD_AUGLAG_EQ LD_CCSAQ LD_LBFGS_NOCEDAL LD_LBFGS LD_MMA LD_SLSQP LD_TNEWTON LD_TNEWTON_RESTART LD_TNEWTON_PRECOND LD_TNEWTON_PRECOND_RESTART LD_VAR1 LD_VAR2 LN_AUGLAG LN_AUGLAG_EQ LN_BOBYQA LN_COBYLA LN_NEWUOA LN_NEWUOA_BOUND LN_NELDERMEAD LN_PRAXIS LN_SBPLX GD_MLSL GD_MLSL_LDS GD_STOGO GD_STOGO_RAND GN_CRS2_LM GN_DIRECT GN_DIRECT_L GN_DIRECT_L_RAND GN_DIRECT_NOSCAL GN_DIRECT_L_NOSCAL GN_DIRECT_L_RAND_NOSCAL GN_ESCH GN_ISRES GN_MLSL GN_MLSL_LDS GN_ORIG_DIRECT GN_ORIG_DIRECT_L ˆ L/G: local/global method ˆ D/N: derivative-based/derivative-free ˆ mostly implemented in C++, some work with Julia/JuMP 25-29

30 Global methods A global method makes an effort to find a global optimum rather than just a local one. ˆ If gradients are available, the standard (and obvious) thing to do is multistart (also known as random restarts). Randomly pepper the space with initial points. Run your favorite local method starting from each point (these runs can be executed in parallel). Compare the different local minima found. ˆ The number of restarts required depends on the size of the space and how many local minima it contains

31 Global methods A global method makes an effort to find a global optimum rather than just a local one. ˆ A more sophisticated approach: Systematically partition the space using a branch-and-bound technique. Search the smaller spaces using local gradient-based search. ˆ Knowledge of derivatives is required for both the bounding and local optimization steps

32 Black-box methods What if no derivative information is available and all we can do is compute f (x)? We must resort to black-box methods (also known as: derivative-free or direct search methods). If f is smooth: ˆ Approximate the derivative numerically by using finite differences, and then use a standard gradient-based method. ˆ Use coordinate descent: pick one coordinate, perform a line search, then pick the next coordinate, and keep cycling. ˆ Stochastic Approximation (SA), Random Search (RS), and others: pick a random direction, perform line search, repeat

33 Black-box methods What if no derivative information is available and f is not smooth? (you re usually in trouble) Pattern search: Search in a grid and refine the grid adaptively in areas where larger variations are observed. Genetic algorithms: Randomized approach that simulates a population of candidate points and uses a combination of mutation and crossover at each iteration to generate new candidate points. The idea is to mimic natural selection. Simulated annealing: Randomized approach using gradient descent that is perturbed in proportion to a temperature parameter. Simulation continues as the system is progressively cooled. The idea is to mimic physics / crystalization

34 Optimization at UW Madison ˆ Linear programming and related topics CS 525: linear programming methods CS 526: advanced linear programming ˆ Convex optimization and iterative algorithms CS 726: nonlinear optimization I CS 727: nonlinear optimization II CS 727: convex analysis ˆ MIP and combinatorial optimization CS 425: introduction to combinatorial optimization CS 577: introduction to algorithms CS 720: integer programming CS 728: integer optimization 25-34

35 External resources Continuous optimization ˆ Lieven Vandenberghe (UCLA) vandenbe/ ˆ Stephen Boyd (Stanford) boyd/ ˆ Ryan Tibshirani (CMU) ryantibs/convexopt/ ˆ L. El Ghaoui (Berkeley) elghaoui/ Discrete optimization ˆ Dimitris Bertsimas (MIT) integer programming ˆ AM121 (Harvard) intro to optimization

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Search & Optimization Search and Optimization method deals with